SiSoftware Sandra Q & A - CPU Benchmark

SiSoftware Sandra - The Diagnostic Tool, Q & A - CPU Benchmark

This document provides some frequently asked questions about Sandra. Please read the Help File as well!

Q: What is the Dhrystone benchmark?
A: The original Dhrystone benchmark is still widely used to measure CPU performance in industry under various versions/variants. The benchmark is designed to contain a representative sample of types of operations, mostly numerical, used by applications. Unfortunately this does not always represent a true real-life performance, but is useful to compare the speed of various CPUs.

The Dhrystone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of instructions. Due to various changes, the result is not directly comparable with other Dhrystone benchmarks. However the MIPS (Million Instructions Per Second) should be the same for the same system (+5-10% variation) between benchmarks.

While the original benchmark does not compute anything, this version does check the results with the expected ones just in case there are problems with the CPU/memory.

Q: What is the Whetstone benchmark?
A: The Whetstone benchmark is widely used in the computer industry as a measure of FPU or Co-Processor performance. Floating-point arithmetic is most significant in programs that require a Co-Processor. These are mostly scientific, engineering, statistical and computer-aided design programs.

The Whetstone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of floating-point instructions. Due to various changes, the result is not directly comparable with other Whetstone benchmarks. However the MFLOPS (Million FLoating OPerations per Second) should be the same for the same system (+5-10% variation) between benchmarks.

Q: What is the SSE2 Whetstone benchmark?
A: With the introduction of SSE2 and its support for double floats (64-bit) it is now possible to write code that does not use the legacy FPU at all. This version shows that the full Whetstone benchmark can be implemented using SSE2 and thus take advantage of the SIMD mode of operation.

Q: Why does the rating vary between sequential runs?
A: On most systems, the value of the rating shouldn't change by more than about +/-5% from run to run. On systems with limited memory it may vary by +/-10% due to memory swapping. If you're seeing variations higher than this, some hardware or software is probably to blame. Do note that "limited memory" depends on operating system, installed drivers, running programs, etc. A badly configured system with 64MB may be worse memory-wise than an 16MB system...

Software Causes:

Do not move the mouse! Playing with the mouse/rodent (!) will affect the result considerably.
Do not play MIDI, Wave, CD or other music playing/generating programs. AVI, MPEG, DVD or other type of movies (or TV) are out too...
Other programs running. Close all programs. If this doesn't solve the problem, check for programs that are loaded in the Startup group, WIN.INI (load and run lines) or in the AUTOEXEC.BAT file.
Problematic device drivers. Some drivers which are not correctly configured or incompatible may slow-down the system. Also some drivers "poll" the system for various reasons, e.g., the CD-ROM auto-run feature.
Power-saving options are turned on (APM). These options may cause the CPU to automatically slow down after even a few seconds of inactivity. Often, "inactivity" is defined as nobody typing on the keyboard or moving the mouse, even if a program is working away.
If you're benchmarking a SMP system, some variation may occur due to the synchronising algorithm in the benchmark. Since the work done by threads must be synchronised, there is a small overhead which may be more apparent in some systems.

Hardware Causes:

Insufficient memory for program to function properly. Close all programs. Also, you may try unloading some drivers which you don't need anymore.
Insufficient secondary (L2/L3) cache memory or poor cache controller. Some old 486 mainboards do not have enough cache (tag) to cache all the system RAM (core). Other have limited logic to cache only 8-16MB. Even many new ones cannot cache above 64MB. Evidently, at the low-price end of the mainboard market there are Pentium boards which behave the same way. Programs which reside in uncached memory (or are moved there by Windows) will run very slowly. Usually you need: 256KB L2 cache for up to 32MB, 512KB cache for up to 128MB and 1MB cache for anything larger. The bigger the cache the faster the machine so if you can specify a larger cache do so.

Q: Why does the rating vary between random runs?
A: You may find that running the program straight after Windows loads you get a higher benchmark rating than after running and closing programs. This should not happen in practice very often, but it does sometimes.

Software Causes:

Programs which do not clean up after closing down. Crashed programs leave "orphaned" objects which take up system resources. While Windows does garbage-collecting, sometimes you just have to restart...
Memory fragmentation. After programs load and close down, memory may still be fragmented after memory is taken and freed. Again, due to various reasons the de-fragmentation does not yield 100% results.

Hardware Causes:

Insufficient secondary (L2/L3) cache memory or poor cache controller. Some old 486 mainboards do not have enough cache (tag) to cache all the system RAM. Other have limited logic to cache only 8-16MB. Evidently, at the low-price end of the mainboard market there are Pentium boards which behave the same way. Programs which reside in un-cached memory (or are moved there by Windows) will run very slowly.

Q: The benchmark scores in Sandra 2002/2001/2000/99 are different from earlier versions!
A: The benchmarks change from release to release in order to keep up with new technology developments. Please compare results using the same version of Sandra.

Q: Why does my P4 Hyper-Threaded/SMT system does so badly?
A: Please update to Sandra 2002 (8.59) or later for SMT support. Earlier versions had issues with data alignment on SMT systems. Please note that later versions of Sandra will have better support for SMT for better performance.

Q: What is this Parallel Execution mode?
A: The parallel execution mode was designed to extract better performance out of Hyper-Threaded/SMT systems by running threads that use different instruction sets and thus use different units in the CPU to result in better resource utilisation. This results in better overall performance although local performance may well decrease in some instances.

Q: Should Parallel Execution be on or off on non-SMT systems?
A: The parallel execution mode should have negligible impact on non-SMT systems, both uni-processor and SMP systems (only thread overhead). Thus it does not really matter whether it is on or off. For compatibility you can turn it off and use sequential execution mode.

Q: While my P4 Hyper-Threaded/SMT system does well in Whetstone FPU/SSE it does not do much better on Dhrystone! Why?
Q: While my P4 Hyper-Threaded/SMT system does well in Multi-Media Float SSE2 it does not do much better in Multi-Media Integer SSE2! Why?
A: The FPU units were under-utilised in the original P4 thus they get good improvement in SMT (~50%); the ALUs (even double-clocked) were better utilised and gave great performance already; thus they get some improvement only (~20%). The parallel execution mode in Sandra was designed to use both ALU & FPU units in different threads and thus get better overall performance.

Q: Why does my SMP Athlon MP sytem does so badly on Dhrystone?
A: Please update to Sandra 2002 (8.59) or later for full support. Earlier versions had issues with false data sharing on Athlon SMP systems.

Q: Why does my Pentium 4 does so badly on FPU Whetstone?
A: The Pentium 4 is optimised for SSE2 and not for legacy FPU code. The SSE2 Whetstone shows the performance improvements once SSE2 is used. Almost all new software will be SSE2 optimised and thus run much faster.

Q: Why do Intel CPUs (e.g. Pentium III) score higher than other CPUs (e.g. Athlon) with more advanced FPUs in Whetstone?
A: The Whetstone benchmark uses the slowest functions of the FPU (computing transcendentals, e.g. sin/cos/tan) in a way that cannot be parallelised (one serial chain). This was done in order to prevent cheating by manufacturers as much as possible. Unfortunately, this means that features like out-of-order execution, pipelining, etc. are bypassed. These are paramout to today's processors but that's not what is tested here.

Thus on processors that transcendental instructions have been optimised the benchmark index will be much higher. Other CPU manufacturers (e.g. AMD) have chosen to optimise other instructions - which they consider more widely used than transcendentals - thus the benchmark index on such CPUs (e.g. K6 series, Athlon) will be lower even if the FPU is more advanced.

Q: I get abnormally high scores on Windows 2000 on my Athlon system. What's up?
A: Please update to SP2 or later. This resolves a timer chip issue that affects some boards and fast CPUs.

Q: My SMP system scores the same or lower than a single CPU system! What's wrong?
A: Here's the most likely causes:

Make sure you use Sandra 2001 or later. Earlier versions do not support SMP.
Make sure you use Windows NT4/2000/XP/.Net. Windows 9X/Me do not support SMP systems.
Make sure you're using the SMP kernel. You should see all your CPUs in task manager for example.
If you're using the SMP ACPI kernel, make sure each CPU can go up to 100%. Some BIOSes contain bugs that do not allow full usage of all CPUs (e.g. old Asus P2B-D).
Make sure no background processes are using the CPU(s), i.e. utilisation is 0% before you're starting the benchmark.

Q: Why does the benchmark support only 64 CPUs?
A: Windows 32-bit NT/2000/XP supports a maximum of 32 CPUs - only Windows 64-bit .Net supports up to 64 CPUs.

Q: Why does the benchmark support only 2 SMT units per CPU?
A: In theory, any number of SMT units per CPU is supported, however, the parallel execution mode is optimised for 2 SMT units per CPU.

Q: Why does the benchmark use as many threads as CPUs?
A: A proper CPU benchmark is CPU bound, thus one thread per CPU should use 100% of power. Using more threads just increases the synchronisation overhead. Otherwise, more threads would help.

Q: I have a X MHz CPU and an Y (X < Y) MHz CPU in my SMP system and the benchmark index is 2*X instead of X+Y!
A: Sandra 2000 (and earlier) uses a static load balancer, thus assigns the same amount of work on both CPUs. Thus, the benchmark time depends on the slowest CPU. Update to Sandra version 2001 or later and turn on the dynamic load balancer (see elsewhere).

Q: Should I use different speed CPUs in my SMP system?
A: Due to the above, we think no. While you may get away with it, your system becomes AMP (asymetric) rather than SMP. This may freak out quite a few operating systems and software out there.

Q: Should I use non-SMT and SMT CPUs in my SMP system?
A: While this is theoretically possible, most likely you'll need to disable the parallel execution mode and run the sequential mode; most likely you'll need to enable the dynamc load balance. There may well be some issues with affinity and thread allocation that would result in lower scores.

Q: When should I use the Parallel Execution Mode?
A: This mode is enabled by default on latest versions.

Q: When should I use the Dynamic Load Balance?
A: Always use the Static Load Balance unless the CPUs are running at different speeds internally. The calibration and synchronisation algorithms of the Dynamic Load balancer are more complex and have a higher overhead thus they should be avoided. We've provided them for testing purposes only.

Q: Changing the CPU speed by SoftFSB has no effect on benchmarks on Windows NT4/2000/XP!
A: Update Windows 2000 to SP1 or later. Update Windows NT4 to SP6a or later.

Q: What is this CPU Multi-Media Benchmark? What does it do?
A: This benchmark generates a picture (860x750) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes each thread to a different CPU.

The benchmark contains 5 versions (ALU, MMX, MMX Enhanced, SSE and SSE2) that use integers to simulate floating point numbers, as well as 5 versions that use floating point numbers (FPU, 3DNow!, 3DNow! Enhanced, SSE and SSE2). This illustrates why older fractal generation programs used integers (e.g. the well-known FractInt): on 386/486 CPUs the integer version is 3-4x times faster.

The (E)MMX, 3DNow! and SSE(2) versions compute 2/4/8 Mandelbrot point iterations at once - rather than one at a time - thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well. We found it useful to see the differences between the old and new versions of CPUs within a family as well as comparing similar CPUs from different manufacturers (e.g. Intel vs. AMD).

Q: How do I compute the computed pixel rate from the index? (aka how fast is the algorithm?)
A: The image rendered is 860x750 pixels, 32 colours. Computed pixel rate = 860*750*index/1000. Thus, for example, if the index is 1,000 - the pixel rate = 860*750*1000/1000 = 490k pixels/second.

Q: Isn't comparing an ALU index to a (E)MMX/SSE(2)/etc. index like comparing apples to oranges?
Q: Isn't comparing a FPU index to a 3DNow!/SSE(2)/etc. index like comparing apples to oranges?
A: It depends on what you're trying to test; the index shows what gain the new instructions bring in getting the task (computing the Mandelbrot fractal in this case) done. If you want to test how two processors perform using the same test, go to Options and disable the test(s) that use(s) the more advanced instructions.

Q: How am I supposed to know what (kind of) test was run?
A: Pay attention at the result bar, it should tell you all about the test as well as the result in it/s. It should say:

Type of unit(s) used. E.g. ALU or FPU.
Type of data used. E.g. integer or floating-point.
Any instruction sets used. E.g. MMX, EMMX, SSE, SSE2, etc.
The score in it/s. (see above how to calculate pixel rate)

Q: Are the tests in the CPU Multi-Media Benchmark optimised for a specific CPU?
A: Yes, the tests are optimised as far as possible but without introducing instructions that would generate large penalties on other processors.

ALU (Integer) Test - Optimised for Intel Pentium core.
FPU (Floating Point) Test - Optimised for Intel Pentium core.
MMX (Integer) Test - Optimised for Intel Pentium MMX core.
Enhanced MMX (Integer) Test - Optimised for AMD Athlon.
SSE (Integer & Floating Point) Test - Optimised for Intel Pentium III.
SSE2 (Integer & Floating Point) Test - Optimised for Intel Pentium 4.
3DNow! (Floating Point) Test - Optimised for AMD K6.
3DNow! Enhanced (Floating Point) Test - Optimised for AMD Athlon.

Q: Will you optimise the benchmarks for other CPUs, e.g. Intel Itanium, AMD Hammer etc?
A: No. Each version is targeted at a instruction family not at a specific processor. There will never be, say, a MMX K6-2, MMX Cyrix and a MMX PII version. However, if a processor introduces new instructions that make a big difference to the algorithm, we will write a new version of the test to use the new instructions.

Q: Why 3DNow! Enhanced and/or SSE integer benchmarks?
A: Both add a very useful set of data comparison & manipulation instructions which do help in achieving higher speeds.

Q: My AMD K6-2+, K6-III+ CPU does not run the 3DNow! Enhanced benchmark!
A: This benchmark also uses MMX Enhanced instructions which are not implemented in the K6-2+/III+ CPUs, thus the test cannot run on it. The instructions are used for speed-ups.

Q: My VIA Cyrix III, IDT WinChip-2/3 runs slow on the 3DNow! benchmark!
A: These processors offer great value for money and thus are not designed to match the AMD K6-X running at the same speed. They offer other benefits.

Q: How much faster is MMX on a P5/P6 class CPU in the CPU Multi-Media Benchmark?
A: On a P5-class processor (e.g. Pentium MMX) the MMX test runs about 170% (i.e. 2.7x) faster than the ALU test. It is 194% (i.e. 3x) faster than the FPU test.

On a P6-class processor (e.g. Pentium II/III) the MMX test runs only 24% (i.e. 1.24x) faster than the ALU test as there are multiple, superpipelined ALUs built-in.

On a P7-class processor (e.g. Athlon, Pentium 4) the MMX test gets similar indexes to the ALU test. This does not mean that MMX is useless now, just that it has been overtaken by other instruction sets.

Q: My Pentium 4 gets beaten by a Pentium III in MMX/SSE!
A: Update to Sandra 2001 or later with SSE2 support.

Q: My Pentium (II/III/4) Xeon is not faster than the Pentium (II/III/4)! Why?
A: The benchmark code tests the core of the CPU only, thus it fits in the L1 cache and does not depend on the L2/L3 cache size.

Q: My Coppermine Pentium III is not much faster than a Katmai Pentium III!
A: The Coppermine does not have lower SSE latencies; the ATC L2 cache is the only one that provides any real benefit.

Q: The SSE(2) benchmark crashes on my system!
A: If you're running Windows NT, install SP6. If you're running Windows 95 you'll need to upgrade to Windows 98/Me/2000/XP that support the new SSE(2) state.

Q: I have a PIII and PII in my SMP system and SSE benchmark crashes!
A: Make sure the PII is the boot (BSP) CPU. The CPU with the lowest function support or lower version should always be the boot processor. Thus software will use only the features supported by all processors.

Q: I have 2 CPUs with different steppings in my SMP system and the benchmarks are weird!
A: Make sure the CPU with the lowest stepping is the boot (BSP) CPU. Thus software will use only the features/timings supported by all processors.

Q: My Cyrix 6x86/MX/MII/MIII or AMD K5, K6-X get very low FPU scores!
A: The built-in FPU in these processors is not pipelined like those in Intel P6+ CPUs or AMD Athlon. The actual FPU performance is roughly on a par with the original Pentium running at the same clock rate and therefore are lower than the performance rating.

Q: Why doesn't the benchmark include my super-duper XXXXGHz CPU?
A: While we do buy and test each and every CPU model on the market, we cannot afford to buy all the very latest speed grades of each CPU. Even if we did, we cannot update the benchmark when a new speed grade is released - we'd need to do it every week.